Assessing Deduplication and Data Linkage Quality: What to Measure?
نویسندگان
چکیده
Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when deduplicating or linking very large data sets. Different measures have been used to characterise the quality of data linkage algorithms. This paper presents an overview of the issues involved in measuring deduplication and data linkage quality, and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess deduplication and data linkage quality.
منابع مشابه
Quality and Complexity Measures for Data Linkage and Deduplication
Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined an...
متن کاملAn Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework
Record linkage is an important process in data integration, which is used in merging, matching and duplicate removal from several databases that refer to the same entities. Deduplication is the process of removing duplicate records in a single database. In recent years, data cleaning and standardization becomes an important process in data mining task. Due to complexity of today’s database, fin...
متن کاملAssessing Nonresponse Bias and Measurement Error Using Statistical Matching
The estimation of nonresponse bias and measurement error share the problem of usually not having a criterion to assess the quality of the estimate. Nonresponse bias analysis often uses responders within the survey sample who are in some way similar to nonresponders to estimate the potential bias. This depends on the variables within the survey being related to both the likelihood of responding ...
متن کاملAn Evaluation Framework For Data Quality Tools
Data Quality is a major stake for large organizations and software companies are proposing increasing numbers of tools focusing on these issues. The scope of these tools is moving from specific applications (deduplication, address normalization etc ...) to a more global perspective integrating all areas of data quality (profiling, rule-detection...). A framework is needed to help managers to ch...
متن کاملProbabilistic Data Generation for Deduplication and Data Linkage
In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and preprocessing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learni...
متن کامل